BUG: read_json does not respect chunksize #38293

robertwb · 2020-12-04T18:37:19Z

closes How to use nrows along with chunksize in read_json() #36791
closes Chunksize from json memory consumption as high as without chunksize #34548
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…son chunksize.

mroeschke · 2020-12-06T23:32:06Z

Thanks. Mind adding a unit test and whatsnew entry for version 1.2

jreback

yeah ideally this change should have broken a test (obviously) we don't have one. pls add a test which breaks on master and this change fixes (look in the original issues, where we want to replicate)

robertwb · 2020-12-22T19:23:29Z

I've added a test and addressed the comments.

pandas/tests/io/json/test_readlines.py

jreback · 2020-12-22T20:03:57Z

can you also post the benchmarks for line read json (should show that this improves vs master)

robertwb · 2020-12-22T20:18:23Z

I don't expect it to improve raw performance, it's about memory usage for large files.

jreback · 2020-12-22T20:38:26Z

I don't expect it to improve raw performance, it's about memory usage for large files.

we have memory benchmarks to measure this

jreback · 2020-12-23T14:11:59Z

@robertwb if you can post the memory benchmarks

simonjayhawkins · 2020-12-23T15:43:28Z

the output is BENCHMARKS NOT SIGNIFICANTLY CHANGED. but IIUC io.json.ReadJSONLines.peakmem_read_json_lines_concat is reduced from 206M to 188M

maybe we need to change the benchmark to see a bigger effect.

(pandas-dev) simon@WIN-7LJRFMBC2ME:~/pandas/asv_bench$ asv continuous -f 1.1 master HEAD -b peakmem_read_json
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Building 4b856ee8 <json-chunksize> for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.........................................
·· Installing 4b856ee8 <json-chunksize> into conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit 4b856ee8 <json-chunksize> (round 1/1):
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 16.67%] ··· io.json.ReadJSONLines.peakmem_read_json_lines                                                           ok
[ 16.67%] ··· ========== ======
                index
              ---------- ------
                 int      250M
               datetime   250M
              ========== ======

[ 33.33%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_concat                                                    ok
[ 33.33%] ··· ========== ======
                index
              ---------- ------
                 int      188M
               datetime   188M
              ========== ======

[ 50.00%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_nrows                                                     ok
[ 50.00%] ··· ========== ======
                index
              ---------- ------
                 int      188M
               datetime   188M
              ========== ======

[ 50.00%] · For pandas commit 2733a109 <master> (round 1/1):
[ 50.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt......................................
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 66.67%] ··· io.json.ReadJSONLines.peakmem_read_json_lines                                                           ok
[ 66.67%] ··· ========== ======
                index
              ---------- ------
                 int      248M
               datetime   248M
              ========== ======

[ 83.33%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_concat                                                    ok
[ 83.33%] ··· ========== ======
                index
              ---------- ------
                 int      206M
               datetime   206M
              ========== ======

[100.00%] ··· io.json.ReadJSONLines.peakmem_read_json_lines_nrows                                                     ok
[100.00%] ··· ========== ======
                index
              ---------- ------
                 int      195M
               datetime   195M
              ========== ======


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

jreback · 2020-12-23T15:51:48Z

the output is BENCHMARKS NOT SIGNIFICANTLY CHANGED. but IIUC io.json.ReadJSONLines.peakmem_read_json_lines_concat is reduced from 206M to 188M

maybe we need to change the benchmark to see a bigger effect.

so how do we know that this is working (and the previous was not), sure this is a difference, but wouldn't you expect a HUGE diffrence here? (e.g. maybe the benchmark should be reading with chunksize=1 or something) vs no chunksize.

simonjayhawkins · 2020-12-23T16:02:20Z

so how do we know that this is working (and the previous was not), sure this is a difference, but wouldn't you expect a HUGE diffrence here? (e.g. maybe the benchmark should be reading with chunksize=1 or something) vs no chunksize.

we do a concat of the chunks in the benchmark. maybe could just iterate through the chunks without the concat.

am currently running with chunksize=1 to compare.

jreback · 2020-12-23T16:06:20Z

so how do we know that this is working (and the previous was not), sure this is a difference, but wouldn't you expect a HUGE diffrence here? (e.g. maybe the benchmark should be reading with chunksize=1 or something) vs no chunksize.

we do a concat of the chunks in the benchmark. maybe could just iterate through the chunks without the concat.

am currently running with chunksize=1 to compare.

oh yeah the concat will increase memory :-> don't compare that

simonjayhawkins · 2020-12-23T16:22:52Z

changing the chunksize to 10 (1 fails) make a bit more difference (to the memory used on master!)

       before           after         ratio
     [2733a109]       [d731dc87]
     <master>         <json-chunksize>
-            224M             189M     0.84  io.json.ReadJSONLines.peakmem_read_json_lines_concat('int')
-            224M             188M     0.84  io.json.ReadJSONLines.peakmem_read_json_lines_concat('datetime')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

simonjayhawkins · 2020-12-23T16:24:13Z

oh yeah the concat will increase memory :-> don't compare that

presumably we do the concat to compare the results with peakmem_read_json_lines

    def peakmem_read_json_lines(self, index):
        read_json(self.fname, orient="records", lines=True)

jreback · 2020-12-23T16:32:16Z

ok fair. @simonjayhawkins wouldn't object to changing the benchmarks (but this PR is fine).

jreback · 2020-12-23T16:32:27Z

thanks @robertwb

jreback · 2020-12-23T16:32:39Z

@meeseeksdev backport 1.2.x

Co-authored-by: Robert Bradshaw <[email protected]>

Fix pandas-dev#36791 and pandas-dev#34548 correctly respecting read_j…

0b477a6

…son chunksize.

mroeschke added the IO JSON read_json, to_json, json_normalize label Dec 6, 2020

jreback added the Bug label Dec 8, 2020

jreback added this to the 1.2 milestone Dec 8, 2020

jreback requested changes Dec 8, 2020

View reviewed changes

jreback removed this from the 1.2 milestone Dec 13, 2020

robertwb added 3 commits December 22, 2020 11:17

Test of read_json chunksize.

76d2ce6

whatsnew entry

b3ea535

Merge branch 'master' into json-chunksize

d8d5942

Make precommit happy.

fc21579

jreback requested changes Dec 22, 2020

View reviewed changes

pandas/tests/io/json/test_readlines.py Show resolved Hide resolved

pandas/tests/io/json/test_readlines.py Show resolved Hide resolved

pandas/tests/io/json/test_readlines.py Show resolved Hide resolved

jreback added this to the 1.2 milestone Dec 22, 2020

Add reference to bug fixed in the test.

4b856ee

jreback merged commit 936d125 into pandas-dev:master Dec 23, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 23, 2020

Backport PR pandas-dev#38293: BUG: read_json does not respect chunksize

e9770a2

meeseeksmachine mentioned this pull request Dec 23, 2020

Backport PR #38293 on branch 1.2.x (BUG: read_json does not respect chunksize) #38658

Merged

jreback pushed a commit that referenced this pull request Dec 23, 2020

Backport PR #38293: BUG: read_json does not respect chunksize (#38658)

2a4c3c6

Co-authored-by: Robert Bradshaw <[email protected]>

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: read_json does not respect chunksize (pandas-dev#38293)

75b28b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_json does not respect chunksize #38293

BUG: read_json does not respect chunksize #38293

robertwb commented Dec 4, 2020 •

edited

Loading

mroeschke commented Dec 6, 2020

jreback left a comment

robertwb commented Dec 22, 2020

jreback commented Dec 22, 2020

robertwb commented Dec 22, 2020

jreback commented Dec 22, 2020

jreback commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jreback commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jreback commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

BUG: read_json does not respect chunksize #38293

BUG: read_json does not respect chunksize #38293

Conversation

robertwb commented Dec 4, 2020 • edited Loading

mroeschke commented Dec 6, 2020

jreback left a comment

Choose a reason for hiding this comment

robertwb commented Dec 22, 2020

jreback commented Dec 22, 2020

robertwb commented Dec 22, 2020

jreback commented Dec 22, 2020

jreback commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jreback commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jreback commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

simonjayhawkins commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

robertwb commented Dec 4, 2020 •

edited

Loading